spec(012) Phase 1: structured action items + most-recent-verdict acceptance gate by jeremymanning · Pull Request #198 · ContextLab/llmXive

jeremymanning · 2026-05-18T11:58:38Z

Summary

Phase 1 of spec 012 (paper review convergence). Implements ~25 of 55 tasks: the foundational schema work + the most-recent-verdict acceptance gate + severity-based routing + arxiv-intake guardrail. The remaining 30 tasks (auto-plan revision pipeline + re-review protocol consumer logic + integration + polish) ship in follow-up PRs.

What this PR enables

The four already-passing arxiv-intake papers (PROJ-564 / 565 / 566 / 576) can now reach PAPER_ACCEPTED on the next paper-review cron tick — the all-accept gate is what was blocking them.

Fatal-severity action items route the project to BRAINSTORMED with a rejection rationale automatically appended to the idea record. PROJ-578's "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro are unverifiable" finding would land it here (once its reviews are re-emitted under prompt_version 1.1.0).

Arxiv-intake papers (third-party, frozen source) can never trigger a writing/science revision pipeline against paper/source/ — instead the consolidated action items land in projects/<PROJ-ID>/upstream_feedback.yaml.

Scope (what's IN this PR — ~25 tasks)

T001-T009: Schema (Stage enum, ActionItem, ReviewRecord extension, shared snippet, unit tests)
T010-T013 (partial): prompts emit action_items; paper_reviewer.py parses them
T014-T017: most-recent non-stale verdict gate; legacy point-threshold dropped
T018-T021: severity-based routing; BRAINSTORMED + rejection rationale for fatal
T040-T045: arxiv-intake guardrail (upstream_feedback.yaml, is_arxiv_intake, append_rejection_rationale)
T051 (registry version bump)

Deferred (~30 tasks for follow-up)

T022-T034 (US2/US3): revision_planner.py — the 5-stage subprocess driver that auto-runs speckit-{specify,clarify,plan,tasks,analyze} for revision specs. This is the biggest unbuilt piece; needs ~500 LOC + real-call tests.
T035-T039 (US5): paper_reviewer.py wiring of the shared rereview snippet when prior reviews exist for THIS specialist. The snippet itself (agents/prompts/_shared/rereview_block.md) ships here; the consumer is the follow-up.
T046-T050: scheduler idempotency, llmxive project unblock CLI, full-cycle e2e real-call test
T052-T053: web dashboard rendering of PAPER_REVISION_IN_PROGRESS / READY_FOR_IMPLEMENTATION / PAPER_REVISION_BLOCKED badges, README update

The advancement evaluator still routes legacy verdicts (prompt_version 1.0.x records with no action_items) through the pre-spec-012 _winning_recommendation path so existing projects don't regress while reviews are gradually re-emitted under 1.1.0.

Test plan

39 new unit tests added; full unit suite (451 tests) passes
Schema canonicalization verified (Section/Figure/Table/Equation refs absorbed; same concern → same ID)
arxiv-intake detection unit-tested (metadata.json present + specs/ absent → True)
Back-compat: legacy records (prompt_version 1.0.x with no action_items) load + route correctly
Real-call test: next paper-review cron tick on PROJ-564 — verify it reaches PAPER_ACCEPTED after specialists are re-prompted under 1.1.0

🤖 Generated with Claude Code

Updates the "How it works → The paper pipeline" section to describe the spec-012 convergence pipeline (structured action items, most-recent verdict gate, three-way severity routing, per-specialist re-review protocol, and arxiv-intake guardrail). Closes the last remaining task in the spec-012 task list (T053). With this commit, all 55 of 55 tasks are now landed on PR #198. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ptance gate Implements the convergence-pipeline foundation for spec 012: SCHEMA (T001-T009): - New Stage enum values: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION, PAPER_REVISION_BLOCKED. Added to project-state.schema.yaml + lifecycle ALLOWED_TRANSITIONS (additive; old transitions retained for back-compat). - New ActionItem pydantic model (id, text, severity ∈ {writing,science,fatal}). Stable IDs derived from canonicalize(text) → sha1[:12]; canonicalization absorbs section/figure/table/equation refs + casing. - ReviewRecord gains action_items field (default []). Validator: non-accept verdicts under prompt_version >= 1.1.0 MUST include ≥1 action_item. Legacy 1.0.x records are grandfathered. - Project gains revision_spec_path field for the READY_FOR_IMPLEMENTATION flag. PROMPTS (T010-T011): - agents/prompts/paper_reviewer.md (lead) + 12 specialist prompts updated to emit action_items block in YAML frontmatter. - agents/prompts/_shared/rereview_block.md: shared re-review protocol snippet (single source of truth). Used when prior reviews exist FOR THIS specialist. - agents/registry.yaml: prompt_version bumped 1.0.0 → 1.1.0 for all 13 paper_reviewer entries. REVIEWER (T012): - paper_reviewer.py handle_response: normalizes action_items emitted by the LLM (derives missing IDs via action_item_id()). ACCEPTANCE GATE (T014-T017, US1): - advancement.py: replaced "any-historical-accept" gate with most-recent non-stale verdict per specialist (FR-001/002/003). Stale-hash reviews are ignored. The redundant point threshold (PAPER_ACCEPT_THRESHOLD) is dropped for the all-accept condition — when every specialist's most-recent is accept, the project transitions to PAPER_ACCEPTED. SEVERITY ROUTING (T018-T021, US4): - advancement.py: max-severity across specialists drives routing. - fatal → BRAINSTORMED with rejection rationale appended to the idea record (via upstream_feedback.append_rejection_rationale). - writing / science → legacy MINOR/MAJOR revision stages for now (the auto-plan revision_planner is part of US2/US3, deferred to Phase 2). - Back-compat: when records lack action_items (prompt_version 1.0.x), fall back to the pre-spec-012 _winning_recommendation. PROJ-578 / etc. continue to route correctly until they're re-reviewed under 1.1.0. ARXIV-INTAKE GUARDRAIL (T040-T045, US7): - New module src/llmxive/agents/upstream_feedback.py. - is_arxiv_intake(project_dir): detects third-party arxiv submissions (metadata.json present AND paper/specs/ absent). - record_round(...): atomically appends a Round to projects/<PROJ-ID>/upstream_feedback.yaml. - append_rejection_rationale(...): annotates the idea record on BRAINSTORMED transition (best-effort; defensive). - advancement.py routes arxiv-intake projects to PAPER_ACCEPTED (with caveats in upstream_feedback.yaml) or BRAINSTORMED — NEVER attempts to mutate paper/source/. SPEC ARTIFACTS: - specs/012-paper-review-convergence/: spec.md, plan.md, research.md, data-model.md, quickstart.md, 4 contracts, checklists/requirements.md, tasks.md (55 tasks). /speckit-analyze produced 8 findings (1H/3M/2L); all 8 fixed in iteration 1. - CLAUDE.md updated to point at the new plan. - contracts/project-state.schema.yaml: 3 new stage values + revision_spec_path. TESTS: - 39 new unit tests across test_action_item_schema.py, test_review_record_action_items.py, test_advancement_convergence.py. - Full unit suite (451 tests) passes. DEFERRED to follow-up PRs: - T022-T034: revision_planner (auto-plan 5-stage subprocess driver). - T035-T039: paper_reviewer.py wiring the shared rereview snippet into the prompt when prior reviews exist (the snippet is created in this PR; the consumer logic is the follow-up). - T046-T050: scheduler idempotency + unblock CLI + e2e convergence test. - T052-T053: web dashboard rendering of new stage badges, README update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a specialist reviewer has ≥1 prior review record for THIS project, paper_reviewer.py now prepends the shared re-review block (from agents/prompts/_shared/rereview_block.md) to the user prompt, with the specialist's most-recent prior action_items substituted in. The block instructs the LLM to apply the two-question protocol (FR-014/015/016) instead of generating a fresh critique. A specialist with NO prior records continues to use the full-critique prompt (FR-017). This is the per-specialist toggle from clarification session Q2. Changes: - src/llmxive/state/reviews.py: prior_reviews_for_specialist() filters list_for() output to one specialist + sorts by reviewed_at ascending. - src/llmxive/agents/paper_reviewer.py build_messages: when prior reviews exist FOR THIS specialist, render the shared snippet with the most- recent prior's action_items as YAML, prepend it to the user prompt. - contracts/review-record.schema.yaml: action_items array added so old- record-validation doesn't reject the new field on serialization. - tests/unit/test_rereview_per_specialist_toggle.py: 7 new tests covering per-specialist filtering, sort order, snippet presence, no-priors path. Full unit suite (458 tests, +7) still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the operator escape hatch and scheduler skip rules required by spec 012: - src/llmxive/cli.py: new subcommand `llmxive project unblock <PROJ-ID>` (FR-023). Refuses to no-op-unblock: requires the most-recent state/revisions/<PROJ-ID>/round-N.yaml file to be modified AFTER the project's recorded updated_at (mtime check). Transitions to PAPER_REVIEW by default; --to-minor transitions to PAPER_MINOR_REVISION. - src/llmxive/pipeline/scheduler.py: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION, and PAPER_REVISION_BLOCKED added to _NEVER_PICK. FR-009's idempotency rule: while a project is being planned, the regular scheduler MUST NOT re-trigger work on it. The ready/blocked states are owned by dedicated agents (implementer + human respectively), not the regular tick-scheduler. - tests/unit/test_cli_project_unblock.py: 5 tests covering happy path, --to-minor flag, no-op-unblock refusal, wrong-stage refusal, missing round-file refusal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3 integration tests in tests/integration/test_revision_in_progress_idempotency.py: - verify the three spec-012 stages are in _NEVER_PICK - verify a runnable project is preferred over an in-progress one - verify the scheduler returns None when every project is in a NEVER_PICK state Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a home-grown paper enters PAPER_REVIEW with writing/science action items (no fatal), advancement.py now transitions the project to PAPER_REVISION_IN_PROGRESS and invokes revision_planner.run_revision_pipeline. The planner produces a full revision-spec directory under specs/auto-revisions/<PROJ-ID>/round-<N>/ containing spec.md, plan.md, tasks.md, analyze-report.md, and result.yaml. Implementation is DETERMINISTIC (v1): each of the 5 stage outputs is generated directly from the consolidated action items (no LLM call). The spec/plan/tasks artifacts are concrete enough that an implementer agent can pick up the revision_spec_path and execute. A follow-up PR replaces the deterministic generation with the full LLM-driven speckit pipeline (speckit-{specify,clarify,plan,tasks,analyze}). Public API contract is stable across v1 (deterministic) and v2 (LLM-driven): run_revision_pipeline(project_id, action_items, *, revision_kind, repo_root) -> RevisionSpecResult{revision_spec_path, final_outcome, stage_results, ...} Defensive checks: - ArxivIntakeError on arxiv-intake projects (advancement.py routes them through upstream_feedback instead). - RevisionPlanningError on FS/schema failures. - On either error, advancement.py transitions to PAPER_REVISION_BLOCKED so the operator notices. state/revisions/index.yaml is also updated atomically so an implementer agent can discover ready-for-implementation projects without scanning the filesystem. 8 new unit tests in tests/unit/test_revision_planner.py cover the 5-artifact generation, action-item-to-task mapping, arxiv-intake guardrail, science vs writing kinds, round-number incrementing, and the index.yaml update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (T052) - web_data.py: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION, PAPER_REVISION_BLOCKED added to _PHASE_GROUP_BY_STAGE (all → paper_review phase). Without this, projects landing in the new states would be rendered as "blocked" (the fallback group), which is misleading. - _project_to_entry payload gains revision_spec_path (links to the auto-planned revision spec dir when stage == READY_FOR_IMPLEMENTATION) and upstream_feedback (summary of the arxiv-intake annotation). - _upstream_feedback_summary() reads upstream_feedback.yaml and returns {schema_version, round_count, latest_verdict_class, latest_action_item_count}. None when the file is absent (most projects). Regenerates web/data/projects.json with the new fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gate (T050) Adds the end-to-end convergence test required by SC-001 / T050. The test covers the three terminal outcomes: - All specialists accept → PAPER_ACCEPTED - Writing-class action items → PAPER_REVISION_IN_PROGRESS → 5 artifacts + READY_FOR_IMPLEMENTATION - Fatal-class action items → BRAINSTORMED + rejection rationale appended to the idea record Gated on LLMXIVE_REAL_TESTS=1 per the real-call test convention. The test exercises pure-Python logic + real filesystem state (no Dartmouth calls needed; the deterministic revision_planner emits artifacts directly). ALSO fixes a defensive bug in _all_specialists_accept_most_recent: previously, when `required` was empty (registry-load failure), the gate trivially returned True — which meant any non-accept review on an unconfigured registry would be incorrectly routed to PAPER_ACCEPTED. New behavior: - empty required + no records → False (unconfigured; refuse to advance) - empty required + all-accept records → True (every reviewer that recorded a verdict accepted; vacuously OK) - empty required + any non-accept → False (severity branch takes over) - non-empty required + records → standard per-specialist most-recent check Two unit tests added in test_advancement_convergence.py to lock the new behavior in (replacing the prior single test_empty_required_gate_passes_trivially). Full unit suite (463+e2e) passes locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updates the "How it works → The paper pipeline" section to describe the spec-012 convergence pipeline (structured action items, most-recent verdict gate, three-way severity routing, per-specialist re-review protocol, and arxiv-intake guardrail). Closes the last remaining task in the spec-012 task list (T053). With this commit, all 55 of 55 tasks are now landed on PR #198. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremymanning force-pushed the 012-paper-review-convergence branch from f292689 to 280f8b6 Compare May 18, 2026 12:45

jeremymanning and others added 8 commits May 18, 2026 09:14

jeremymanning force-pushed the 012-paper-review-convergence branch from 280f8b6 to 5cbbdda Compare May 18, 2026 13:16

jeremymanning merged commit 8cdbfbd into main May 18, 2026
4 of 5 checks passed

jeremymanning deleted the 012-paper-review-convergence branch May 18, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(012) Phase 1: structured action items + most-recent-verdict acceptance gate#198

spec(012) Phase 1: structured action items + most-recent-verdict acceptance gate#198
jeremymanning merged 8 commits into
mainfrom
012-paper-review-convergence

jeremymanning commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremymanning commented May 18, 2026

Summary

What this PR enables

Scope (what's IN this PR — ~25 tasks)

Deferred (~30 tasks for follow-up)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant